Introduction to Open Data Science - Course Project

Chapter 1: Start me up!

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Thu Nov 25 11:49:11 2021"

1.1 Some thoughts

I’m really excited to start this course. I expect to learn a lot of using R and RStudio.

Chapter 2: Regression and model validation

date()
## [1] "Thu Nov 25 11:49:11 2021"

The second week in short:

2.1 Data

The original data was collected in 2014/2015 from students participating in Introduction to Social Statistics -course. For more information about the data and its variables see the data description.

# read data from local file
learning2014 <- read.table("data/learning2014.csv")
# examine the structure and dimensions of the data
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014) 
## [1] 166   7

After data wrangling the data set utilized here consists of 166 observations of 7 variables (i.e. 166 rows and 7 columns), meaning that the data provides information from 166 respondents on seven variables. Deep, stra and surf are combination variables based on multiple questions measuring surface, strategic and deep learning (for more information about the variables see here). My code for the preparation of the data can be found in my repository.Next, I’ll present an overview of the data and its variables.

2.2 Overview of the data

# summary of the variables in the data
summary(learning2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :14.00   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :32.00   Median :3.667  
##                     Mean   :25.51   Mean   :31.43   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :50.00   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00
# show summary of the gender variable when it is transformed to factor
summary(as.factor(learning2014$gender))
##   F   M 
## 110  56
# access to libraries ggplot2 and GGally
## if not installed yet, use: install.packages("name_of_package")
library(GGally)
library(ggplot2)

# scatter plot matrix using ggplot2 and GGally -packages
ggpairs(learning2014, lower = list(combo = wrap("facethist", bins = 20)))

## note: if scatter plot matrix is drawn by using r basic function pairs(), it seems that the gender variable needs to be transformed to a factor (in the Datacamp execise gender is a factor, not character)
# pairs(learning2014[-1], col = as.factor(learning2014$gender))

Above the summary tables show a summary of each variable in the data. Gender is a categorical (‘character’) variable, while the others are continuous (‘numerical’). Regarding the numerical variables the summary provides information about their minimum, maximum and median values as well as lower (1st) and upper (3rd) quartiles of their distributions. As gender variable is a ‘character’, the summary shows only the number of total observations. When the variable is transformed to a factor, the summary function shows the number of female and male respondents.

The graphical presentation of the distributions of all the variables can be seen in the scatter plot matrix. The scatter plot matrix also provides statistics about the correlation between the variables. Here I focus only on the last column of the scatter plot matrix which shows the correlation between ‘points’ and other continuous variables; the highest correlation coefficient is found between points and attitude (0.437), followed by variables stra (0.146) and surf (-0.144), while the correlation coefficient between points and age (-0.093) or points and deep (-0.010) are really low. According to the scatter plot matrix only the correlation between points and attitude is statistically significant (p < 0.001). Also worth to notice, that there are statistically significant correlations between surf and attitude, surf and deep and surf and stra, indicating a potential problem of multicollinearity if those variables were used simultaneously as explanatory variables in a linear regression model.

2.3 Linear regression modelling

2.3.1 First model

In the following linear regression model ‘points’ is the response variable and - on the grounds of the correlation coefficients - attitude, stra and surf are selected as explanatory variables.

model1 <- lm(points ~ surf + stra + attitude, data = learning2014)
summary(model1)
## 
## Call:
## lm(formula = points ~ surf + stra + attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## surf        -0.58607    0.80138  -0.731  0.46563    
## stra         0.85313    0.54159   1.575  0.11716    
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The results of the multiple linear regression model are shown in the summary table above. F-statistic with low p-value (p < 0.001) indicates that all the regression coefficients in the model are not zero, meaning that at least one of the variables included in the model has an actual statistical effect on the target variable (points). The multiple R-squared illustrates how well the model is fitting the data. Here, the multiple R-squared (0.2074) shows, that the model (i.e. explanatory variables) explains around 21 % of the variation in the response variable. The summary of residuals (i.e. the difference between observed and predicted response values) provides information about the symmetry of the residual distribution. Residual are dealt more in chapter 2.4 along with model diagnostics.

The regression coefficient describes the relationship between explanatory and response variables; it illustrates the change in the response variable when the explanatory variable alter by one unit and the other variables stay constant. According to the coefficient table the estimates of surf and stra are not statistically significant. Instead the estimate of attitude is statistically significant with very low p-value (p < 0.001), showing that when attitude increases by one unit the points increase 0.3395.

2.3.2 Second model

Based on the above presented results from multiple linear regression model, I modify the model by excluding variables which regression coefficients are not statistically significant (p > 0.05). Namely, variables surf and stra are dropped out and only attitude will be included in the model.

model2 <- lm(points ~ attitude, data = learning2014)
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The results from the simple linear regression analysis are similar to above discussed results from the multivariate linear model. The multiple R-squared decreases a little bit showing that the new model explains about 19 % of the variation in the response variable. The value of multiple R-squared typically decreases when variables are taken out from the model (and increases when more variables are included), while the Adjusted R-squared takes account for the number of explanatory variables in the model. The Adjusted R-squared is practically the same in both models, that is around 19%.

The coefficient table includes two estimates: intercept and attitude. The estimate of the intercept term can be interpreted here as the expected value of the response variable (points) when the explanatory variable (attitude) is 0. This information could be utilized, for instance, if we want to calculate/predict the points based on the known values of the explanatory variable. For example, if a person’s attitude is 25, we can predict their points by calculating as follows: 11.64715 + 0.35255 x 25 = 20,4509 (i.e. intercept + regression coefficient of attitude x value of attitude = points).

The estimate of attitude describes the relationship between attitude and points; the statistically significant (p < 0.001) regression coefficient of attitude (0.3526) shows that when attitude increases by one, the points increase by 0.3526 points.

2.4 Regression diagnostics

The key assumptions of linear regression model examined here are:

  • the relationship between the response and explanatory variables is linear
  • the errors are normally distributed
  • the errors have constant variance
  • the size of a given error does not depend on the explanatory variable

The validity of these assumptions can be explored by analyzing the residuals. The figure below consists of three diagnostic plots: 1) ‘Residuals vs. Fitted values’, ‘2) Normal QQ-plot’ and 3) ‘Residuals vs. Leverage’.

par(mfrow = c(2, 2)) # to put the following figures in one 2x2 figure
plot(model2, which = c(1,2, 5))

In the figure ‘Residuals vs. fitted’ the residuals are plotted against the fitted values of the response variable. The plot provides a graphical method to examine whether the errors have constant variance. A pattern in the scatter plot would indicate a problem relating this assumption, and imply that the size of the errors depend on the explanatory variables. Here, the residuals seem to be reasonably randomly spread out above and below the line suggesting that the assumptions relating to linearity and constant variance are valid.

Normal QQ-plot of the residuals helps to explore whether the errors are normally distributed. In an ideal case, the residuals would have a perfect match with the diagonal line, and in practice the better the points follow the line, the stronger evidence for the normality assumption. In the QQ-plot above, the residuals fit the line reasonably well, although a little curve is visible as the lower and upper tail deviate slightly from the diagonal line.

‘Residuals vs. leverage’ plot describes how large influence a single observations has in the model and thus helps to identify if some observations have an unusually strong impact on determining the regression line. Observations falling outside of the Cook’s distance line (the dotted red line) are considered to be highly influential to the model fit (as they have large residual and high leverage), meaning that the results would change considerably if the observation were removed from the data. In the figure above the “Cook’s distance” line does not even appear in the plot indicating that none of the observation have an unusually large impact on the results.


Chapter 3: Logistic regression

date()
## [1] "Thu Nov 25 11:49:18 2021"

My third week in short

3.1 Data

# access to all packages needed
library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
library(ggpubr)
library(boot)

# read the prepared data set from local file
alc <- read.table("data/alc.csv", sep = ";")

# examine the data
dim(alc)
## [1] 370  35
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

In this exercise the data is drawn from the Student Performance Data set, which includes information on student achievements in secondary education of two Portuguese schools. After data wrangling, the data set used here consist of 370 observations (rows) of 35 variables (columns), that is information from 370 respondents regarding 35 variables. The names of the variables included in the prepared data set are shown above and more detailed description of them is presented in the description of the Student Performance Data.

3.2 Selected variables and hypotheses

After examining the variables in the data I chose the following four variables for further examination as I assume they could be associated with the level of alcohol consumption:

  1. sex; student’s gender (F = female, M = male)
  2. romantic; with a romantic relationship (yes or no)
  3. studytime; weekly study time (numeric: 1 = less than 2 hours, 2 = 2 to 5 hours, 3 = 5 to 10 hours, 4 = over 10 hours)
  4. goout; going out with friends (from 1 (very low) to 5 (very high))

My hypotheses are:

  1. Male students are more likely to be high users of alcohol compared to females
  2. Students with a romantic relationship are less likely to be high users compared to those you are not in a relationship
  3. Students who spend more time studying are less likely be high users of alcohol than those who study less
  4. Students who go out with their friends more often are more likely to be high users of alcohol than the students who go out less often

3.2.1 Overview of the selected variables

Below I present the summary tables as well as plots of each variable of interest.

# before examining the variables I'll transform the two character variables of interest to factors
alc <- mutate(alc, sex = as.factor(sex), romantic = as.factor(romantic))

# pick the names of the variables of interest
varnames <- select(alc, sex, romantic, studytime, goout, high_use) %>%
  colnames()

# summary of each variable
select(alc, varnames) %>%
  summary()
##  sex     romantic    studytime         goout        high_use      
##  F:195   no :251   Min.   :1.000   Min.   :1.000   Mode :logical  
##  M:175   yes:119   1st Qu.:1.000   1st Qu.:2.000   FALSE:259      
##                    Median :2.000   Median :3.000   TRUE :111      
##                    Mean   :2.043   Mean   :3.116                  
##                    3rd Qu.:2.000   3rd Qu.:4.000                  
##                    Max.   :4.000   Max.   :5.000
# a plot of each variable
gather(alc[varnames]) %>% ggplot(aes(value)) + 
  facet_wrap("key", scales = "free") +
  geom_bar()

3.2.2 Alcohol consumpion and selected variables

Next, I examine how the selected variables are related with alcohol consumption. First, all variables of interest are tabulated with the ‘high use of alcohol’ variable, after which the figure shows the proportional distributions of high users regarding each variable.

# cross-tabulations with high_use:

# sex and high_use
addmargins(table(alc$sex, alc$high_use))
##      
##       FALSE TRUE Sum
##   F     154   41 195
##   M     105   70 175
##   Sum   259  111 370
# romantic and high_use
addmargins(table(alc$romantic, alc$high_use))
##      
##       FALSE TRUE Sum
##   no    173   78 251
##   yes    86   33 119
##   Sum   259  111 370
# high_use and studytime
addmargins(table(alc$high_use, alc$studytime))
##        
##           1   2   3   4 Sum
##   FALSE  56 128  52  23 259
##   TRUE   42  57   8   4 111
##   Sum    98 185  60  27 370
# high_use and goout
addmargins(table(alc$high_use, alc$goout))
##        
##           1   2   3   4   5 Sum
##   FALSE  19  82  97  40  21 259
##   TRUE    3  15  23  38  32 111
##   Sum    22  97 120  78  53 370
# proportion figures

t1 <- ggplot(data = alc, aes(x = sex, fill = high_use)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  ylab("%") +
  xlab("Gender")

t2 <- ggplot(alc, aes(romantic, fill = high_use)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  ylab("%") +
  xlab("Romantic relationship")

t3 <- ggplot(alc, aes(x = goout, fill = high_use)) + 
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  ylab("%") +
  xlab("Go out with friends")

t4 <- ggplot(alc, aes(x = studytime, fill = high_use)) + 
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  ylab("%") +
  xlab("Studytime")

# all plots in one figure ('ggarrange' is from 'ggpubr' package)
ggarrange(t1 + rremove("legend"), t2 + rremove("legend"), t3 + rremove("legend"), t4 + rremove("legend"),
          ncol = 2, nrow = 2, 
          common.legend = TRUE, legend = "right")

The tables and figure above indicate that bigger share of males are high users of alcohol compared to females. Similarly, the more the student go out with friends or the less time they use for studying, the bigger is the share of high users. In contrast, the share of high users of alcohol is quite the same among those who are in a romantic relationship and those who are not. Thus, these descriptive results provide support for the above-presented hypotheses 1, 3 and 4, while hypothesis 2 is rather questionable.

3.3 Logistic regression

# logistic regression model
## studytime is used as factors in the model
model1 <- glm(high_use ~ sex + romantic + goout + as.factor(studytime), data = alc, family = "binomial")

# summary of the results
summary(model1)
## 
## Call:
## glm(formula = high_use ~ sex + romantic + goout + as.factor(studytime), 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7282  -0.7953  -0.4800   0.8780   2.7249  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -3.1839     0.5126  -6.212 5.25e-10 ***
## sexM                    0.7096     0.2667   2.660  0.00781 ** 
## romanticyes            -0.1061     0.2747  -0.386  0.69922    
## goout                   0.7426     0.1198   6.202 5.59e-10 ***
## as.factor(studytime)2  -0.4045     0.2950  -1.371  0.17033    
## as.factor(studytime)3  -1.1405     0.4726  -2.413  0.01581 *  
## as.factor(studytime)4  -1.3334     0.6245  -2.135  0.03276 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 379.20  on 363  degrees of freedom
## AIC: 393.2
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(model1) %>% 
  exp

# compute confidence intervals (CI) for the odds ratios
CI <- confint(model1) %>%
    exp
## Waiting for profiling to be done...
# combine the odds ratios and their confidence intervals
OR_with_CI <- cbind(OR, CI)
# round the results 
round(OR_with_CI, digits = 2)
##                         OR 2.5 % 97.5 %
## (Intercept)           0.04  0.01   0.11
## sexM                  2.03  1.21   3.45
## romanticyes           0.90  0.52   1.53
## goout                 2.10  1.67   2.68
## as.factor(studytime)2 0.67  0.37   1.19
## as.factor(studytime)3 0.32  0.12   0.78
## as.factor(studytime)4 0.26  0.07   0.83

The results from a logistic regression model are presented above. The value of null deviance is for a model without any explanatory variables and the residual deviance is the value when taking all explanatory variables into account. Higher numbers indicate bad fit, and respectively smaller values indicate good fit. The difference between null deviance and residual deviance illustrates how good is the fit of the given model with explanatory variables; the greater the difference, the better. Although the value of the residual deviance is quite high, the difference shows that the model with explanatory variables is more fit than a model including only the intercept term. The value of AIC (Akaike’s information criteria) can be used for comparing competing models (this will be discussed more below).

The coefficients illustrates the association between the explanatory variables and the target variables. The estimates can be also used, for instance, for calculating predicted probabilities of being a high user based on the known values of explanatory values. According to the coefficient table having a romantic relationship is not significantly associated with the use of alcohol. Instead, the results show that gender, time used for studying and going out with friends are statistically significant variables predicting high consumption of alcohol. However, all the associations are not rectilinear, which will be discussed along with the odds ratios.

The odds ratio (OR) illustrates the difference in the odds of being high user between groups. For instance, in the case of sex variable females are the reference group (OR = 1.00) and the odds of males are compared to the odds of females; the odds ratio shows that males (OR 2.03) are about twice as likely to be high users compared to females. Going out with friends more often increase (OR = 2.10) the likelihood of being high user compared to those who goes out the least often (note that the actual scale of the variable is somewhat obscure, and thus difficult to interpret). Finally, compared to students who study less than two hour, those who study 5 to 10 hours are about three times (OR = 0.34) less likely to be high users and respectively those who study over 10 hours (OR = 0.28) are about 3.5 times less likely to be high users. Instead, those who study 2-5 hours are not statistically more or less likely to be high users compared to the reference group (the coefficient is not statistically significant and the CI’s of OR include 0).

Predictions

According to the above-presented model all variables except ‘romance’ had statistically significant relationship with high/low alcohol consumption. Next, we build a new regression model without the romance variable and examine how accurate the model predictions are.

# new logistic regression model
model2 <- glm(high_use ~ sex + as.factor(studytime) + goout, data = alc, family = "binomial")

# predicted probabilities and prediction (> 0.5) of high_use and add it to the data.frame
probabilities <- predict(model2, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the observed high use versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   234   25
##    TRUE     58   53
# probability table of high use vs. predictions
prob_table <- table(high_use = alc$high_use, prediction = alc$prediction) %>%
    prop.table %>%
    addmargins
round(prob_table * 100, digits = 2) # probs in % and rounded
##         prediction
## high_use  FALSE   TRUE    Sum
##    FALSE  63.24   6.76  70.00
##    TRUE   15.68  14.32  30.00
##    Sum    78.92  21.08 100.00
ggplot(alc, aes(x = probability, y = high_use, col = prediction)) +
  geom_point(alpha = 0.5, size = 3)

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# Calculate the share of false predictions (false positives + false negatives)
## this can be seen from the above prob.table as well
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2243243

The first table provides cross-tabulation of the predictions against the actual values of high_use, and the next table shows the proportions of each cell. According to the proportion table about 77.6 % of predictions (cells: False/false + True/true) match with actual values of observations, while 22,4 % of the predictions (cells: True/False + False/True) are incorrect. This is also confirmed by using the self-defined ‘loss function’ to see the share of false predictions, which is 22,4 %. The plot should visualize the results, although it seems there is something wrong with the figure (but I can’t find an error in the code).

3.4 Cross-validation

3.4.1 10-fold cross-validation of the model (bonus)

# compute the average number of wrong predictions in the (training) data
## loss func is defined in previous chunk
nwrong_train <- loss_func(class = alc$high_use, prob = alc$probability)
# mean error in in the training data
nwrong_train
## [1] 0.2243243
# 10-fold cross-validation
crossvalid <- cv.glm(data = alc, cost = loss_func, glmfit = model2, K = 10)
# the mean prediction error for the testing data
crossvalid$delta[1]
## [1] 0.2486486

The mean error in the training data is 0.224. The mean prediction error for the testing data is around 0.24 in the 10-fold cross-validation. The mean prediction error in the Datacamp model was about 0.26, suggesting that the model presented here has slightly better test performance; i.e. the model is more accurate predicting the high consumption of alcohol.

3.4.2 Finding more parsimonious model (super-bonus)

Here I utilize the AIC (Akaike’s Information Criteria) backward elimination -procedure for selecting the explanatory variables for the model. The AIC index takes into account the statistical goodness of fit and the number of variables in the model by increasing a penalty for a greater number of variables. In the series of competing models lower AIC values are preferred; i.e. the aim is to achieve as low AIC value as possible by excluding variables - that makes the value higher - one at the time. The elimination ends when removing more variables would not improve the AIC score. I start by selecting 10 variables and from there start the elimination.

# logistic regression model
model3 <- glm(high_use ~ sex + age + Pstatus + absences + failures + schoolsup + as.factor(studytime) + goout + activities + freetime, data = alc, family = "binomial")

step(model3, direction = "backward")
## Start:  AIC=386.87
## high_use ~ sex + age + Pstatus + absences + failures + schoolsup + 
##     as.factor(studytime) + goout + activities + freetime
## 
##                        Df Deviance    AIC
## - schoolsup             1   360.87 384.87
## - Pstatus               1   360.89 384.89
## - age                   1   361.00 385.00
## - as.factor(studytime)  3   365.07 385.07
## - freetime              1   361.34 385.34
## <none>                      360.87 386.87
## - failures              1   363.14 387.14
## - activities            1   363.79 387.79
## - sex                   1   370.47 394.47
## - absences              1   372.93 396.93
## - goout                 1   392.92 416.92
## 
## Step:  AIC=384.87
## high_use ~ sex + age + Pstatus + absences + failures + as.factor(studytime) + 
##     goout + activities + freetime
## 
##                        Df Deviance    AIC
## - Pstatus               1   360.89 382.89
## - age                   1   361.00 383.00
## - as.factor(studytime)  3   365.08 383.08
## - freetime              1   361.34 383.34
## <none>                      360.87 384.87
## - failures              1   363.15 385.15
## - activities            1   363.80 385.80
## - sex                   1   370.75 392.75
## - absences              1   372.93 394.93
## - goout                 1   392.92 414.92
## 
## Step:  AIC=382.89
## high_use ~ sex + age + absences + failures + as.factor(studytime) + 
##     goout + activities + freetime
## 
##                        Df Deviance    AIC
## - age                   1   361.03 381.03
## - as.factor(studytime)  3   365.09 381.09
## - freetime              1   361.38 381.38
## <none>                      360.89 382.89
## - failures              1   363.18 383.18
## - activities            1   363.80 383.80
## - sex                   1   370.77 390.77
## - absences              1   372.99 392.99
## - goout                 1   392.99 412.99
## 
## Step:  AIC=381.03
## high_use ~ sex + absences + failures + as.factor(studytime) + 
##     goout + activities + freetime
## 
##                        Df Deviance    AIC
## - as.factor(studytime)  3   365.16 379.16
## - freetime              1   361.50 379.50
## <none>                      361.03 381.03
## - failures              1   363.56 381.56
## - activities            1   364.01 382.01
## - sex                   1   371.03 389.03
## - absences              1   373.55 391.55
## - goout                 1   394.24 412.24
## 
## Step:  AIC=379.16
## high_use ~ sex + absences + failures + goout + activities + freetime
## 
##              Df Deviance    AIC
## - freetime    1   365.69 377.69
## <none>            365.16 379.16
## - failures    1   369.03 381.03
## - activities  1   369.14 381.14
## - absences    1   379.72 391.72
## - sex         1   380.24 392.24
## - goout       1   398.78 410.78
## 
## Step:  AIC=377.69
## high_use ~ sex + absences + failures + goout + activities
## 
##              Df Deviance    AIC
## <none>            365.69 377.69
## - activities  1   369.50 379.50
## - failures    1   369.59 379.59
## - absences    1   380.01 390.01
## - sex         1   382.11 392.11
## - goout       1   404.79 414.79
## 
## Call:  glm(formula = high_use ~ sex + absences + failures + goout + 
##     activities, family = "binomial", data = alc)
## 
## Coefficients:
##   (Intercept)           sexM       absences       failures          goout  
##      -3.99781        1.05729        0.08345        0.45660        0.72059  
## activitiesyes  
##      -0.51287  
## 
## Degrees of Freedom: 369 Total (i.e. Null);  364 Residual
## Null Deviance:       452 
## Residual Deviance: 365.7     AIC: 377.7

The backward elimination suggest that we should keep five of those original 10 variables; sex, failures, activities, absences and goout. Interestingly, studytime was excluded! Next, I will run the logistic model with those variables and conduct the 10-fold cross-validation.

model4 <- glm(high_use ~ sex + failures + activities + absences + goout,
              data = alc, family = "binomial")

summary(model4)
## 
## Call:
## glm(formula = high_use ~ sex + failures + activities + absences + 
##     goout, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2685  -0.7612  -0.5209   0.7648   2.4960  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -3.99781    0.48451  -8.251  < 2e-16 ***
## sexM           1.05729    0.26683   3.962 7.42e-05 ***
## failures       0.45660    0.23337   1.957  0.05040 .  
## activitiesyes -0.51287    0.26475  -1.937  0.05272 .  
## absences       0.08345    0.02258   3.695  0.00022 ***
## goout          0.72059    0.12300   5.858 4.67e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 365.69  on 364  degrees of freedom
## AIC: 377.69
## 
## Number of Fisher Scoring iterations: 4
# 10-fold cross-validation
crossvalid2 <- cv.glm(data = alc, cost = loss_func, glmfit = model4, K = 10)
# the mean prediction error for the testing data
crossvalid2$delta[1]
## [1] 0.1972973
# visual
probabilities_m4 <- predict(model4, type = "response")
alc <- mutate(alc, probability_m4 = probabilities_m4)
alc <- mutate(alc, prediction_m4 = probability_m4 > 0.5)

ggplot(alc, aes(x = probability_m4, y = high_use, col = prediction_m4)) +
  geom_point(alpha = 0.5, size = 3)

The table above presents the summary of the fourth model. According to the 10-fold cross-validation, the mean prediction error for the testing data is around 0.20, suggesting that this model has better test performance than the model 3 examined above. The figure provides of graphical confirmation/evidence for this assumption. Moreover, the value of residual deviance (smaller) and the difference between null vs. residual deviance (bigger) suggest that this new model is better than the previous one.

Thanks for reading and for the upcoming feedback!


Chapter 4: Clustering and classification

date()
## [1] "Thu Nov 25 11:49:22 2021"

My fourth week in short

4.1 Data

# access to packages
library(MASS)
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)

# get the data (included in MASS)
data(Boston)

# structure of the data
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The Boston data set utilized here includes ‘Housing Values in Suburbs of Boston’, that is different variables relating housing in the city of Boston. The data consists of 506 observations of 14 numeric (or integer) variables (i.e. 506 rows and 14 columns). Description of the variables can be seen from the description of the data set

4.2 Overview of the Boston data

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# let's use gather from tidyr-package; gather returns key-value pairs of variables
# draw a bar plot of each variable
gather(Boston) %>% 
  ggplot(aes(value)) + 
  facet_wrap("key", scales = "free") +
  geom_histogram()

# construct a correlation matrix and round the results
cor_matrix <-cor(Boston) %>%
    round(digits = 2)

# visualize the correlation matrix
corrplot(cor_matrix, method = "circle", type = "upper", cl.pos = "r", tl.pos = "d", tl.cex = 0.8)

Summaries of the variables and the histograms presented above illustrate that the variables in the data set have quite skewed distributions and that the variables also have quite different scales compared to each other.

The correlation matrix shows that many of variables are highly correlated. The correlation coefficients vary from 1 to -1. A positive coefficient indicate that high values of variable ‘X’ are associated with the high values of variable ‘Y’. Respectively negative coefficient indicates that high values of ‘X’ are associated with low values of ‘Y’.

4.3 Standardized dataset

First I’ll scale the Boston data, that is standardize each variable by its scale and save the standardized variables as a data.frame instead of a matrix. When variables are centered, their mean is adjusted to zero as can be seen from the summaries of the scaled variables which are shown below.

After scaling the data, I will create a new factor variable from the ‘crime rate per capita’ variable and use the quantiles as cut points. A summary of the new ‘crime’ variable can be found below.

Finally, I will split the scaled data into a training and testing data sets, which are then used in the next subchapter.

# scale the Boston data and transform the matrix to data.frame
boston_scaled <- Boston %>%
  scale() %>%
  as.data.frame()
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# pick the quantiles of crim
bins <- quantile(boston_scaled$crim)
# create label names
crim_lab <- c("low", "med_low", "med_high", "high")
# create new factor variable
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = crim_lab)
# add the new variable to the data frame and remove the old one
boston_scaled$crime <- crime
# NB: if dplyr is not defined below, error may occur when knitting index-file, because tries to use different package!
boston_scaled <- dplyr::select(boston_scaled, -crim)
## alternative way: boston_scaled <- subset(boston_scaled, select = -crim)

# summary of the new variable
table(boston_scaled$crime)
## 
##      low  med_low med_high     high 
##      127      126      126      127
# take a sample of 80% observations; i.e. pick randomly row numbers between 1 and 0.8 x rows
# pick the number of total observations in the data
n <- nrow(boston_scaled)
# take a sample of 80% observations; i.e. pick randomly row numbers between 1 and 0.8 x rows
train_indexes <- sample(n, size = n * 0.8)
# create the training set by using the defined indexes for rows
train_set <- boston_scaled[train_indexes,]
# create the testing set by excluding the row used for training set
test_set <- boston_scaled[-train_indexes,]

4.4 Linear discriminant analysis (LDA)

Relating this part of the exercise there was multiple problems with the DataCamp exercises and the video was not available. Anycase, let’s fit the LDA and use ‘crime’ variable as the target variable and all the other variables as predictors. Below I show the results in a table and in a biplot. No interpretation were asked regarding this in the instructions, so none provided.

# fit the linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train_set)
# print the LDA results
lda.fit
## Call:
## lda(crime ~ ., data = train_set)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2673267 0.2400990 0.2400990 
## 
## Group means:
##                  zn      indus        chas        nox          rm        age
## low       0.8820958 -0.9200942 -0.19513102 -0.8478258  0.44761627 -0.8561416
## med_low  -0.1495827 -0.2656327 -0.05360128 -0.5463269 -0.16142707 -0.3324223
## med_high -0.3891090  0.1518098  0.21473488  0.3925735  0.02051922  0.3980438
## high     -0.4872402  1.0172187 -0.15056308  1.0341139 -0.38636894  0.8155224
##                 dis        rad        tax     ptratio      black       lstat
## low       0.8682244 -0.6891247 -0.7351723 -0.39381703  0.3755724 -0.76149283
## med_low   0.3077654 -0.5554496 -0.4603125 -0.09322631  0.3244670 -0.10037260
## med_high -0.3631561 -0.4597331 -0.3584517 -0.21660391  0.0973394  0.04914639
## high     -0.8471459  1.6371072  1.5133254  0.77958792 -0.6831779  0.94495303
##                  medv
## low       0.513326282
## med_low  -0.002157566
## med_high  0.076242944
## high     -0.799873264
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.049486013  0.65511306 -0.97484049
## indus    0.101881230 -0.21572805  0.36337221
## chas    -0.168419665 -0.15680650  0.04840405
## nox      0.276243540 -0.75327521 -1.50239868
## rm      -0.094489913 -0.02632840 -0.18406758
## age      0.179455348 -0.30731359 -0.18472257
## dis     -0.044963835 -0.18616611  0.04837891
## rad      3.727372902  0.89365941 -0.12932975
## tax     -0.005777846  0.03142512  0.72642830
## ptratio  0.091156742 -0.01699341 -0.44792743
## black   -0.096586342  0.09139483  0.20082253
## lstat    0.233271461 -0.23267700  0.35329598
## medv     0.170234430 -0.42102266 -0.17972763
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9565 0.0305 0.0130
# draw LDA biplot
classes <- as.numeric(train_set$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)

Next, I will save the observed frequencies of the crime categories from the test data and then remove the variable, after which I test how well the fitted model predicts the values. The results are shown below. The classifier, that is the fitted model, did not predict the crime rates perfectly, alghouth the predictions are somewhat accurate. Most of the predictions matches the observed classes, and those that do not match falls mostly to the next classes (i.e. prediction are not right but close).

# save the crime variable separately
correct_cases <- test_set$crime
# remove crime variable from the test data
## MASS has also select() -> dplyr::select
test_set <- dplyr::select(test_set, -crime)

# use LDA model for predicting crime cases
lda.predict <- predict(lda.fit, newdata = test_set)
# cross tabulate the correct cased with predictions
table(correct = correct_cases, predicted = lda.predict$class)
##           predicted
## correct    low med_low med_high high
##   low       17       7        1    0
##   med_low    4      12        2    0
##   med_high   0       7       19    3
##   high       0       0        0   30

4.5 K-means clustering

Here, I reloaded the Boston data and recreated a new data set, in which the variables are standardized to be able to compare the distances. Next, I calculate the ‘euclidean’ distances between observations. A summary of the calculated distances are shown in the table above.

# clean the environment (i.e. data.frames and variables etc.)
rm(list = ls())

# reload the Boston dataset
data(Boston)

# scale the Boston data and transform the matrix to data.frame
boston_scaled <- Boston %>%
  scale() %>%
  as.data.frame()

dist_euc <- dist(boston_scaled, method = "euclidean")
summary(dist_euc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

After that K-means clustering is conducted. First with three centers and then after examining the within cluster sum of squares (WCSS) K-means clustering is run again with two centers that was suggested by the results of the investigation of the WCSS. Below the results from two-center-clustering are visualized in the scatter plot matrices in which the clusters are colored; first plot shows all of the variables (which is quite useless) and then the three plots show parts of the big picture; all of them the two more or less distinct cluster are visible.

# k-mean clustering with 4 centers
km_boston <- kmeans(boston_scaled, centers = 3)

# examination of WCSS, that is deciding optomal number of centers
# set seed
set.seed(123)
# determine the number of clusters in the examination
k_max <- 10
# calculate the total WCSS
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# examination of WCSS and the visualization suggest that two center would be optimal
km_boston <- kmeans(boston_scaled, centers = 2)

# visualze the k-mean clustering results
pairs(boston_scaled, col = km_boston$cluster)

pairs(boston_scaled[1:5], col = km_boston$cluster)

pairs(boston_scaled[6:10], col = km_boston$cluster)

pairs(boston_scaled[11:14], col = km_boston$cluster)

THAT’S ALL!


Chapter 5: Dimensionality reduction techniques

date()
## [1] "Thu Nov 25 11:49:32 2021"
library(dplyr)
library(ggplot2)
library(GGally)
library(corrplot)
library(tidyr)

5.1 Data

The data set used here is drawn from Human Development Index and Gender Development Index. More information can be found from the United Nations Development Programme’s webpage. The prepared data set examined here consists of 155 observations (that is countries) of 8 variables.

Description of the variables in the prepared data:

  • GNIperCap = Gross National Income per capita
  • LifeExp = Life expectancy at birth
  • ExpEdu = Expected years of schooling
  • MatMortalityRate = Maternal mortality ratio
  • AdoBirthRate = Adolescent birth rate
  • FemalesParliament = Percetange of female representatives in parliament
  • SecondEduRatio = Proportion of females with at least secondary education / Proportion of males with at least secondary education
  • LabourRatio = Proportion of females in the labour force / Proportion of males in the labour force

The structure of the data and summaries of the variables are shown below.

# download the data
human <- read.table("data/human.csv", sep = ";")

str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ GNIperCap        : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ LifeExp          : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ ExpEdu           : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ MatMortalityRate : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ AdoBirthRate     : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ FemalesParliament: num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
##  $ LabourRatio      : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ SecondEduRatio   : num  1.007 0.997 0.983 0.989 0.969 ...
summary(human)
##    GNIperCap         LifeExp          ExpEdu      MatMortalityRate
##  Min.   :   581   Min.   :49.00   Min.   : 5.40   Min.   :   1.0  
##  1st Qu.:  4198   1st Qu.:66.30   1st Qu.:11.25   1st Qu.:  11.5  
##  Median : 12040   Median :74.20   Median :13.50   Median :  49.0  
##  Mean   : 17628   Mean   :71.65   Mean   :13.18   Mean   : 149.1  
##  3rd Qu.: 24512   3rd Qu.:77.25   3rd Qu.:15.20   3rd Qu.: 190.0  
##  Max.   :123124   Max.   :83.50   Max.   :20.20   Max.   :1100.0  
##   AdoBirthRate    FemalesParliament  LabourRatio     SecondEduRatio  
##  Min.   :  0.60   Min.   : 0.00     Min.   :0.1857   Min.   :0.1717  
##  1st Qu.: 12.65   1st Qu.:12.40     1st Qu.:0.5984   1st Qu.:0.7264  
##  Median : 33.60   Median :19.30     Median :0.7535   Median :0.9375  
##  Mean   : 47.16   Mean   :20.91     Mean   :0.7074   Mean   :0.8529  
##  3rd Qu.: 71.95   3rd Qu.:27.95     3rd Qu.:0.8535   3rd Qu.:0.9968  
##  Max.   :204.80   Max.   :57.50     Max.   :1.0380   Max.   :1.4967

The graphical overview of the data is presented below. The scatter plot matrix shows the distributions of the variables and illustrates their relationship. A better overview of the relationship between the variables is provided by the correlation matrix, which shows that many of the variables are highly correlated with each other. The correlation coefficients vary from 1 to -1. A positive coefficient indicates that high values of variable ‘X’ are associated with the high values of variable ‘Y’. Respectively negative coefficient indicates that high values of ‘X’ are associated with low values of ‘Y’.

# scatter plot matrix using ggplot2 and GGally -packages
ggpairs(human, lower = list(combo = wrap("facethist", bins = 20)))

# correlation matrix
cor(human) %>%
  corrplot(method = "circle", type = "upper", cl.pos = "r", tl.cex = 0.8)

5.2 Principal Component Analysis

Next, I will perform principal component analysis (PCA) on the human data: first with unstandardized data and after that with standardized data set.

5.2.1 Unstandardized data

# PCA with SVD method (unstandardized data)
pca_human_ustd <- prcomp(human)
# summary of PCA results
smry <- summary(pca_human_ustd)
smry
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
#rounded percentages of variance captured by each PC
pca_pr <- round(1*smry$importance[2, ] * 100, digits = 2)
# create object pc_lab (variance captured) to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# biplot of the resuts
biplot(pca_human_ustd, 
       choices = 1:2, 
       cex = c(0.7, 0.9), 
       col = c("black", "red"),
       xlab = pc_lab[1], ylab = pc_lab[2])

The results from PCA performed with unstandardized data are shown above in the summary table and graphically in the biplot. They clearly show that the first principal component captures (misleadingly) almost all variance (99.99%), which is due to the use of unstandardized data. PCA is sensitive to the relative scaling of the data and it assumes that variables with larger variance are more important than those with smaller variance, which is not the case here and, thus the data must be standardized before performing PCA.

5.2.2 Standardized data

The results from PCA of standardized data are presented below. Now the first principal component captures about 54 percent of the variance and the second component about 16 percent. In the biplot the arrows can be interpreted as follows:

  • the angle between the arrows illustrates the correlation between the variables
  • the angle between a variable and a PC axis illustrate the correlation between the two
  • the length of the arrow is a proportional to the standard deviation of the variable

From the figure one can notice, for instance, that GNI per capita, life expectancy, expected education and gender ratio in the secondary education are all highly ‘positively’ correlated with each other. Respectively, maternal mortality is positively correlated with adolescence birth rate and these two are negatively correlated with the variables mentioned in the previous sentence (e.g. maternal mortality is negatively correlated with life expectancy). Moreover, the figure illustrate that all these above mentioned features are contributing to the first principal component. Instead, the variables FemalesParliament and LabourRatio are contributing to the second principal component and the vertical arrows illustrate that the share of women in the parliament correlates positively with the labour ratio between the genders.

# standardize the human data and save it as data frame
human_std <- as.data.frame(scale(human))
# PCA with SVD method (standardized data)
pca_human_std <- prcomp(human_std)
# summary of PCA results
smry_std <- summary(pca_human_std)

#rounded percentages of variance captured by each PC
pca_pr_std <- round(1*smry_std$importance[2, ] * 100, digits = 2)
# create object pc_lab (variance captured) to be used as axis labels
pc_lab_std <- paste0(names(pca_pr_std), " (", pca_pr_std, "%)")

# biplot of the resuts
biplot(pca_human_std, 
       choices = 1:2, 
       cex = c(0.5, 0.7), 
       col = c("black", "red"),
       xlab = pc_lab_std[1], ylab = pc_lab_std[2])

5.3 Multiple Correspondence Analysis

The ‘tea’ data used here consists of 300 observations of 36 variables. I do not see any reason for visualizing the whole data set, so I just don’t do it. Instead, I pick a few of the variables. The summary and visualization of the subset is shown below.

# load the data set from FactoMineR-package
library(FactoMineR)
data("tea")
# explore the data
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# select a few colums and create a new data set
keep_vars <- c("Tea", "How", "how", "sugar", "where", "lunch")
keep_vars2 <- c("Tea", "how", "where")
tea_time <- dplyr::select(tea, one_of(keep_vars))
# structure and summary of the new dataset
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
#visualization of the new data set
gather(tea_time) %>% ggplot(aes(value)) + 
  facet_wrap("key", scales = "free") +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Next, I will perform the Multiple Correspondence Analysis (MCA). The visualization of the results from MCA are shown below.

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the MCA model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# plot the MCA results
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")

In the figure the variables are drawn on the first two dimensions. The distance between the categories provides a measure of their similarity. The vertical dimension relates, at least to some extent, to from where the tea is bought. Respectively, the horizontal dimension seems to be related to how the tea is consumed (e.g. tea bags, unpacked etc.). From the figure one can interpret, for example, that buying tea from a tea shop is closer to using unpacked tea than to using tea bags and similarly getting tea from a chain store is closer to using tea bags than to using unpacked tea.